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Abstract Based on extensive field failure data for Tandem’s 

GUARDIAN operating system, this paper discusses evaluation of 
the dependability of operational software. Software faults consid- 
ered are major defects that result in processor failures and invoke 
backup processes to take over. The paper categorizes the underly- 
ing causes of software failures and evaluates the effectiveness of 
the process pair technique in tolerating software faults. A model 
to describe the impact of software faults on the reliability of an 
overall system is proposed. The model is used to evaluate the sig- 
nificance of key factors that determine software dependability 
and to identify areas for improvement 

An analysis of the data shows that about 77% of processor 
failures that are initially considered due to software are con- 
firmed as software problems. The analysis shows that the use of 
process pairs to provide checkpointing and restart (originally 
intended for tolerating hardware faults) allows the system to tol- 
erate about 75% of reported software faults that result in proces- 
sor failures. The loose coupling between processors, which results 
in the backup execution (the processor state and the sequence of 
events) being different from the original execution, is a major 
reason for the measured software fault tolerance. Over two-thirds 
(72%) of measured software failures are recurrences of previ- 
ously reported faults. Modeling, based on the data, shows that, in 
addition to reducing the number of software faults, software de- 
pendability can be enhanced by reducing the recurrence rate. 

Index Terms — Measurement, fault categorization, software 
fault tolerance, recurrence, software reliability, operational 
phase, Tandem GUARDIAN System. 

I. Introduction 

T his paper discusses evaluation of the dependability of op- 
erational software based on measurements taken from the 
Tandem GUARDIAN operating system. The Tandem 
GUARDIAN system is a commercial fault-tolerant system 
built for on-line transaction processing and decision support. 
The GUARDIAN operating system is a message-based operat- 
ing system that runs on a Tandem machine. Many studies have 
sought to improve the software development environment by 
using the failure data collected during the development phase 
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[1], [2], [3], The dependability issues for operational software 
are typically very different from those for software under de- 
velopment, due to differences in the operational environment 
and software maturity. Also, the dependability of operational 
software needs to be investigated in the context of the overall 
system. 

A study of the dependability of operational software based 
on real measurements requires, in addition to instrumentation 
and data collection, an understanding of the system architec- 
ture, hardware, and software. It also requires an understanding 
of the development, service, and operational environments. 
Typically, measurement-based studies attempt to answer sev- 
eral questions; What are the key failure modes and their sig- 
nificance, how well do specific fault-tolerance techniques 
work, and what is a realistic behavior model for the software 
and its associated parameters? This paper presents results 
based on field failure data collected from the Tandem 
GUARDIAN operating system. The data cover a period ex- 
tending over four months. The issues addressed include soft- 
ware fault categorization, an evaluation of the software fault 
tolerance of process pairs (a key hardware fault-tolerance 
technique used in Tandem systems), and evaluation of the im- 
pact of software faults on the overall system. 

The next section discusses related research. Section III in- 
troduces the Tandem GUARDIAN system and the measure- 
ments made. Section IV investigates the underlying causes 
(faults) that resulted in the observed software failures and 
categorizes the identified faults. The significance of failure 
recurrence is also discussed. Section V evaluates the software 
fault tolerance of process pairs. The reasons for achieving this 
software fault tolerance are investigated. This evaluation is 
important because, although process pairs are specific to Tan- 
dem systems, they are an implementation of the general ap- 
proach of checkpointing and restart. Section VI builds a model 
that describes the impact of faults in the GUARDIAN operat- 
ing system on the reliability of an overall Tandem system. A 
sensitivity analysis is conducted to evaluate the significance of 
the factors that determine software dependability and to iden- 
tify areas for improvement. Section VII summarizes the major 
conclusions of this study. 

n. Related Research 

Software errors in the development phase have been exten- 
sively studied. Software error data collected from the DOS/VS 
operating system during the testing phase were analyzed in [4]. 
A wide-ranging analysis of software error data collected dur- 
ing the development phase was reported in [5]. An error 
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analysis technique was used to evaluate software development 
methodologies in [6]. Relationships between die Muency an 
distribution of errors during software development, mam 
nance of the developed software, and a variety of environ 
mental factors were analyzed in [7]. The orthogonal defect 
classification, the use of observed software defects to prov.de 
feedback on the development process, was proposed in [ ]. 
These studies mainly attempt to fine-tune the software devel- 
opment environment based on error analysis. 

Software reliability modeling has also been studied exte - 
sively, and many models have been proposed [H, [2], 3 J. nor 
the most part, these models attempt to estimate die reliability 
of software by analyzing the failure history of software during 
the development phase, verification efforts, and operationa 

^Measurement-based analysis of operational software . de- 
pendability has also evolved over the past 15 years. An early 

study proposed a workload-dependent probabilistic model for 
predicting software errors based on measurements from a DEC 
system [9]. The effect of workload on operating system reli- 
ability was analyzed using the data collected from “ ,B ^ 
3081 machine tunning VM/SP [10]. A Markov n>odel *o de- 
scribe the software error and recovery process in a production 
environment using error logs from the MVS operating system 
was discussed in [11]. Software defects and then unpacton 
system availability were investigated using data from .** ® 

MVS system in [12]. in [13], results from a census of Tandem 
systems were presented. The data showed that software was 
the major source (62%) of outages in the Tandem system, e- 
pendability and fault tolerance of three operating ^ms-fte 
Tandem GUARDIAN system, the IBM MVS system, and th 
VAX VMS system— were analyzed using error logs in l ]- 
Software failures have also been studied from the software 
fault-tolerance perspective. Two major approaches for soft- 
wme fe.lt tolerance-recovery blocks and V-vers.on pro- 
gnunrnbig—were proposed in [15], [16], Dependability model- 
tag and evaluation of these two approaches viere d^cussed m 
1171 The effectiveness of recovery routines m the MVS op- 
erating system was evaluated using measurements from an 
IBM 3081 machine in [18]. Software fau t to erance in 
Tandem GUARDIAN operating system was discussed in [ J, 
[20], Architectural issues for uico^orating hardw^eand soft- 
ware fault tolerance were discussed in 121], 

III. Tandem system and Measurements 

The Tandem GUARDIAN system is a message-based mul- 
tiprocessor system built for on-line transaction processing an 
decision support [20]. A Tandem GUARDIAN system consist 
of two to 16 processors, dual interprocessor buses dual-port 
device confers, input/outpu. 

buses, and redundant power supplies (Fig. 0 - The key o 
ware components are processes and messages. Wilh a separate 
copy of the GUARDIAN operating system tunning on each 
processor, these abstractions hide the physical boundaries be- 
rween S processors and systems and provide a uniform environ- 
ment across a network of Tandem systems. 


In the Tandem GUARDIAN system, a critical system func- 
tion or user application is replicated on two 
mary and backup processes, i.e., as a process pair. N y, 
only the primary process provides service. The primary sends 
checkpoints to the backup, so that the backup can take over the 
taction when the primary fails. The GUARDIAN system 
software halts the processor it runs on when it detects i nome 
coverable errors. Nonrecoverable errors are a subset of excep 
tions in privileged system processes. They are ^ e ‘“ ted by e _ 
operating system or explicit software checks made by pnvi 
leged system processes. The designer determines whether a 
specific exception is nonrecoverable. The “I’m alive n'essag 
protocol allows the other processors to detect the halt and to 
tae over the primaries that were running on the halted proces- 
sor With multiple processors running process pairs dial in 
terprocessor buses, dual-port device controllers, multiple VO 
buses, disk mirroring, and redundant power supplies the sys- 
tem can tolerate a single failure in a processor, bus, device 
controller, disk, or power supply. 



Fig. 1. Tandem GUARDIAN system architecture. 

In this paper, a software fault is a defect in the measured 
software system, and a software failure is a processor failure 
due to software. The terms processor halt and processor fai 
Re used interchangeably. Fig. 2 

failure and recovery process in the Tandem GUARDIAN sys 
tem When a fault in the system software is exercised, an error 
(a first error) is generated. Depending on the processor state, 
this error may disappear or cause additional errors before b 
ing detected. The impact of a detected error ranges from .a mi- 
nor cosmetic problem at the user/system interface to ^'abasc 
corruption. A software failure occurs when the system soft- 
ware detects nonrecoverable errors and asserts a pro 

hal Once a software failure occurs, the system attempts to re- 
cover using backup processes on other processors^ If this re* 
covery is successful, the system can tolerate the softwarefaulL 
The time it takes for the system to detect a processor halt and 
for the backup to attain the primary’s pre failure state depend. 
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on several factors, such as the priority of the process, proces- 
sor configuration, and workload. The recovery usually takes 
about 10 seconds. If a job takeover is not successful or if a 
backup process faces the same problem after a takeover, a 
double processor halt occurs. Regardless of whether the recov- 
ery is successful, the software fault is identified and a fix is 
made. A single software fault can cause multiple software fail- 
ures at a single site or at multiple sites (“Recurrences” in 
Fig. 2). 



Fig. 2. Software failure and recovery in the Tandem GUARDIAN system. 


duced the incident, and designed and tested a software fix. 

Software causes were identified for 153 TPRs ( Cause 
Identified”). If a TPR identified a fault in the software and 
resulted in a software fix, the incident was counted as a soft- 
ware problem, even if it was initially triggered by a non- 
software cause (e.g., a hardware fault). In 26 TPRs ( Cause 
Unidentified”), analysts believed that the underlying problems 
were software faults, but they had not yet located the faults. 
We use the term unidentified failures to refer to these cases. 
The rest of the TPRs (“Non-Software Problem”) were due 
mainly to hardware faults (e.g., a failure in power supply) or 
operational faults (e.g., incorrectly specifying hardware speci- 
fications in a system table). Note that 76.5% of the TPRs that 
were initially classified as software problems were confirmed 
as software problems, 13% of them were probably software 
problems, and the rest (10.5%) were non-software problems. 
The 179 TPRs (“Cause Identified” and “Cause Unidentified”) 
formed the basis of our analysis. Fig. 3 specifies which groups 
of the TPRs were used to build the subsequent tables. 


The human-generated software failure reports used in this 
study were extracted from the Tandem Product Report (TPR) 
database, a component of the Tandem Product Reporting Sys- 
tem (PRS). A TPR is used to report all problems, questions, 
and requests for enhancements by users or Tandem employees 
concerning any Tandem product. A TPR consists of a header 
and a body. The header provides fixed fields for information 
such as the date, problem type, urgency, user and system iden- 
tifications, and a brief problem description. The body of a TPR 
is a textual description of all actions taken by Tandem analysts 
in diagnosing the problem. If a TPR reports a software failure, 
the body also includes the log of the memory dump analyses 
performed by Tandem analysts. The information in a TPR 
clearly indicates whether the incident was a software failure, 
whether the underlying fault was fixed, and whether the TPR 
shared the underlying fault with other TPRs. Two-hundred 
TPRs for the GUARDIAN operating system that cover a pe- 
riod extending over four months in 1991 were used for this 
study. 

IV. Fault Categorization 

Several studies have performed fault categorization based 
on faults identified during the development phase [4], [5], [7]. 
Software fault profiles in operational software can be quite 
different, due to differences in the operational environment 
and software maturity. We studied the underlying causes of 
200 TPRs that reported processor failures seemingly due to 
faults in the Tandem system software [24]. 

Fig. 3 shows a breakdown of the TPRs into three categories 
(Software “Cause Identified,” Software “Cause Unidentified,” 
and “Non-Software Problem”). Determining whether a failure 
was caused by software faults is not straightforward, due partly 
to system complexity and partly to close interactions between 
the software and the hardware in the system. The only reliable 
approach is to declare an incident to be a software problem 
only after analysts have located a fault in the software, repro- 


Non-Software Problem (10.5%) 



Fig. 3. Problem types. 

Table I shows the fault categories we selected in conjunc- 
tion with analysts. The table also shows the number of unique 
faults and the number of TPRs associated with each category. 
For example, the “Data fault” category contained 12 unique 
faults, and these faults caused 21 TPRs. Note that a single fault 
may recur and generate multiple TPRs, because many users 
run the same software. The 153 TPRs whose software causes 
were identified were due to 100 unique faults. 

A software failure caused by a newly found fault is referred 
to as a first occurrence’, a software failure caused by a previ- 
ously reported fault is referred to as a recurrence. Recurrences 
exist for several reasons. First, designing and testing a fix of a 
problem can take a significant amount of time. In the mean- 
time, recurrences can occur at the same site or at other sites. 
Second, the installation of a fix sometimes requires a planned 
outage, which may force users to postpone the installation and 
thus cause recurrences. Third, a purported fix can fail. Finally 
and probably most importantly, users who did not experience 
problems due to a certain fault often hesitate to install an 
available fix for fear that doing so will cause new problems. 
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TABLE I 

SOFTWARE FAULT CATEGORIZATION 


r Fault Category 

^Faults 

#TPRs 

Incorrect computation 

3 

3 

Data fault 

12 

21 

Data definition fault 

3 

7 

Missing operation: 

20 

27 

Uninitialized pointers 

(6) 

(7) 

Uninitialized nonpointer variables 

(4) 

(6) 

Not updating data structures on the occurrence 

(6) 

(9) 

of certain events 



Not telling other processes about the occur- 

(4) 

(5) 

rence of certain events 



Side effect of code update 

4 

5 

Unexpected situation: 

29 

46 

Race/timing problem 

(14) 

(18) 

Errors with no defined error-handling proce- 

(4) 

(8) 

dures 



Incorrect parameters or invalid calls from user 

(3) 

(7) 

processes 



Not providing routines to handle legitimate but 

(8) 

(13) 

rare operational scenarios 



Microcode defect 

4 

8 

Others (cause does not fit any of the above class) 

10 

12 

l Jnable to classify due to insufficient information 

15 

24 

All 

100 

153 


Most of the categories in Table I are self-explanatory. 
“Incorrect computation” refers to an arithmetic overflow or the 
use of an incorrect arithmetic function (e.g., use of a signed 
arithmetic function instead of an unsigned one). “Data fault” 
refers to the use of an incorrect constant or variable. “Data 
definition fault” refers to a fault in declaring data or in defin- 
ing a data structure. “Missing operation” refers to an omission 
of lines of source code. “Side effect of code update occurs 
when not all dependencies between software modules were 
considered when updating the software. Unexpected situa- 
tion” refers to cases in which software designers did not an- 
ticipate a legitimate operational scenario, and the software did 
not handle the situation correctly. In the 24 TPRs we were 
“Unable to classify due to insufficient information,” analysts 
did not provide detailed information about the nature of the 
underlying faults. “Missing operation” and “Unexpected situa- 
tion” were the most common types of software faults in the 
measured software system. Additional code inspection and 
testing efforts can be used to identify such faults. 

Out of the 100 software faults observed during the meas- 
ured time window, 57 faults were diagnosed before the time 
window (i.e., were recurrences) and 43 were newly identified 
during the time window (i.e., were first occurrences). In other 
words, over two-thirds of the TPRs (72%; 110 out of 153) 
reported recurrences. When one considers that a single TPR 
may list a rapid succession of failures, which are likely to be 
caused by the same fault, the actual percentage of recurrences 
may be higher. 

Recurrences are not unique to Tandem systems. Similar 
cases have been reported in IBM [25] and AT&T systems 
[26]. In environments where many users run (different versions 
of) the same software, the number of identified faults is not the 


only factor determining software dependability. Recurrences 
can seriously degrade software dependability in the field. In 
[25], a preventive software service policy that takes both the 
number of recurrences and the service cost into account was 
discussed. An approach for automatically diagnosing recur- 
rences based on symptoms was proposed in [27]. The issue of 
recurrence is discussed further in Section VI. 

V. Software Fault Tolerance 
Due to Process Pairs 

In [13], [19], it was observed that process pairs allow the 
Tandem GUARDIAN system to tolerate certain software 
faults. That is, in many cases of processor halts due to software 
faults, the backup of a failed primary can continue the execu- 
tion. This observation is rather counterintuitive, because the 
primary and backup run the same copy of the software. The 
phenomenon is explained by the existence of subtle software 
faults that are not exercised again on a restart of the failed 
software. Usually, field software faults not identified during 
the testing phase are subtle and require very specific condi- 
tions to be triggered. Since the process pair technique was not 
explicitly intended for tolerating software faults, study of field 
data is essential for understanding this phenomenon and for 
measuring its effectiveness. 

This section investigates the user-perceived ability of the 
Tandem system to tolerate faults in its system software [24]. 
The software faults considered here are major defects that re- 
sult in processor failures. Although process pairs are specific 
to Tandem systems, they are an implementation of the general 
approach of checkpointing and restart. This evaluation is im- 
portant because it suggests that these may be low-cost tech- 
niques for achieving software fault tolerance in large, con- 
tinually evolving software systems. Attempts were recently 
made in [28], [29] to take advantage of the subtle nature of 
some software faults to enhance software fault tolerance in 
user applications. 

A. Measure of Software Fault Tolerance 

Table II shows the severity of the measured software fail- 
ures. In this table, a single processor halt implies that the built- 
in single-failure tolerance of the system masked the software 
fault that caused the halt. All multiple processor halts were 
grouped because, given the Tandem architecture, a double 
processor halt can potentially cause additional processor halts. 
For example, if the system loses a set of disks as a result of a 
double processor halt and the set of disks contains files re- 
quired by other processors, additional processor halts can oc- 
cur. There was one case in which a software failure occurred in 
the middle of a system reboot. Since each TPR reports just one 
problem, sometimes two TPRs were generated as a result of a 
multiple processor halt. There were five such cases. Thus, the 
179 TPRs reported 174 software failures. 
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TABLE II 

SEVERITY OF SOFTWARE FAILURES 


Severity 

# Failures 

Further 

Characterized in 

Single processor hah 

138 

Table m 

Multiple processor halt 

31 

Table IV 

Halt occurring during system reboot 

1 

- 

Unable to classify 

4 

- 

All 

174 



In this evaluation, the term software fault tolerance (SFT) 
refers to the system’s ability to tolerate software faults. Quanti- 
tatively, it is defined as 

SPY _ number of software failures in which a single processor is halted ^ 
total number of software failures 

SFT represents the user-perceived ability of the system to tol- 
erate faults in its system software due to the use of process 
pairs. Table II shows that process pairs provide a significant 
level of software fault tolerance in the Tandem GUARDIAN 
environment. The measure of software fault tolerance is esti- 
mated to be 82% (138 out of 169, excluding the five special 
cases). 1 

B. Outages Due to Software 

This evaluation first focused on the multiple processor halts. 
For each multiple processor halt, we investigated the first two 
processor halts to determine whether the second halt occurred 
on the processor executing the backup of the failed primary 
process. In these cases, we also investigated whether the two 
processors halted because of the same software fault. 


TABLE III 

Reasons for Multiple Processor Halts 


Reasons for Multiple Processor Halts 

# Failures 

The second halt occurs on the processor executing the 

24 

backup of the failed primary. 


The second halt occurs due to the same fault that halted 

(17) 

the primary. 


The second halt occurs due to another fault during job 


takeover. 


Unable to classify. 


The second halt is not related to process pairs. 

4 

The system hangs. 

(1) 

Faulty parallel software executes. 

(1) 

There is a random coincidence of two independent 

(1) 

faults. 


A single processor halt occurs, but system coldload is 

(1) 

necessary for recovery. 


Unable to classify. 

3 

All 

31 


The level of software fault tolerance achieved with process 
pairs is high, but not perfect: a single fault in the system soft- 
ware can manifest itself as a multiple processor halt, which the 

l This measure is based on reported software failures. The issue of underre- 
porting was discussed in [13]. The consensus among experienced Tandem 
engineers is that about 80% of software failures are not reported as TPRs and 
that most of them arc single processor halts. If that assessment is true, then 
the software fault tolerance may be as high as 96%. 


system is not designed to tolerate. Table III shows that in 86% 
of the multiple processor halts (24 out of 28, excluding 
“Unable to classify” cases), the backup of the failed primary 
process was unable to continue the execution. In 81% of these 
halts (17 out of 21, excluding “Unable to classify” cases), the 
backup failed because of the same fault that caused the failure 
of the primary. In the remaining 19% of the halts, the proces- 
sor executing the backup of the failed primary halted because 
of another fault during job takeover. About half of the multiple 
processor halts resulted in system coldloads. (A system cold- 
load is a situation in which ail processors in a system are re- 
loaded.) The data showed that, in most situations, the system 
lost a set of disks that contained files required by other proces- 
sors as a result of the first two processor halts, and other proc- 
essors also halted. This sequence is the major failure mode of 
the system resulting from software faults. 

C. Characterization of Software Fault Tolerance 

The information in Table II raises the question of why the 
Tandem system lost only one processor in 82% of software 
failures and, as a result, tolerated the software faults that 
caused these failures. We identified the reasons for software 
fault tolerance (SFT) in all single processor halts (138 in- 
stances; refer to Table II) and classified them into several 
groups. Table IV shows that in 29% of single processor halts 
(40 out of 138), the fault that caused a failure of a primary 
process was not exercised again when the backup reexecuted 
the same task after a takeover. These situations occurred be- 
cause some software faults are exposed in a specific memory 
state (e.g., running out of buffer), on the occurrence of a single 
event or a sequence of asynchronous events during a vulner- 
able time window (timing), by race conditions or concurrent 
operations among multiple processes, or on the occurrence of a 
hardware error. 


TABLE IV 

Reasons for Software Fault Tolerance 


Reasons for Software Fault Tolerance 

Fraction (%) 

The backup reexecutes the failed task after takeover. 


but the fault that caused a failure of the primary is not 

■ 

exercised by the backup. 

■ ■ 

Memory state 

1 

Timing 


Race or concurrency 

1 

Hardware error 

■ 

Others 


The backup, after takeover, does not automatically 

20 

reexecute the failed task. 



5 

A fault stops a processor running a backup. 

16 

The cause of a problem is unidentified. 

19 

Unable to classify. 

12 


Fig. 4 shows a real example of a fault that is exercised in a 
specific memory state. The primary of an I/O process pair, 
which is represented by SIOP(P) in the figure, requested a 
buffer to serve a user request. Because of the high activity in 
the processor executing the primary, the buffer was not avail- 
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able. However, because of a software fault, the buffer man- 
agement routine returned a “successful” flag, instead of an 
“unsuccessful” flag. The primary used the returned, uninitial- 
ized buffer pointer, and a halt occurred in the processor run- 
ning the primary because of an illegal address reference by a 
privileged process. Clearly, such a situation was not tested 
during the development phase. Since a memory dump is usu- 
ally taken only from a halted processor in a production system, 
a memory dump of the processor running the backup was not 
available. Our best guess is that the backup process served the 
request again after takeover but did not have a problem, be- 
cause a buffer was available on the processor running the 

backup. 


CPU A CPUB 



Fig. 4. Differences between the primary and backup executions. 


Table IV also shows that, in 20% of single processor halts 
(28 out of 138), the backup of a failed primary process did not 
have to serve the failed request after a successful takeover. 
This happened because some faults are exposed while serving 
requests that are important but are not automatically resubmit- 
ted to the backup upon a failure of the primary. Fig. 5 illus- 
trates an example of such a situation. In the figure, process PK 
is an execution of a utility to monitor processor activity for 
memory usage, message information, and paging activity. 
Process PK does not run as a process pair because, if the proc- 
essor being monitored halts while executing PK, there is no 
need to monitor the halted processor any longer. Process MS 
collects resource usage data, and process TM is in charge of 
concurrency control and failure recovery. Both MS and TM 
run as process pairs. 

When the operator ran PK with a certain option that is not 
frequently used, PK used an incorrect constant to initialize its 
data structure. As a result, it overwrote (cleared) the page ad- 
dresses of the first segment in the segment page table. The first 
segment is always owned by MS, and MS was running on the 
processor. When MS stored resource usage data, it used incor- 
rect addresses (addresses of zero) and corrupted the system 
global data. A processor halt occurred as a result of an address 
violation when TM accessed and used the address of a system 
data table. When the backups of the failed primaries took over, 
they did not have problems, because PK was running only on 
the halted processor. 

Another example is the faults that cause processor failures 
during the execution of the operator requests for reconfiguring 
I/O units. An I/O unit is a device or program by which an end- 


user (a terminal operator, an application program, or an I/O 
mechanism) gains access to the system. Utilities to perform 
these reconfigurations run as process pairs, but the operator 
command to add, activate, or abort an I/O unit is not automati- 
cally resubmitted to the backup, because it is an interactive 
task that can easily be resubmitted by the operator if the pri- 
mary fails. Suppose that an operator’s request to add an I/O 
unit caused a failure of the primary. In this situation, the opera- 
tor would typically recover the halted processor, rather than 
submit the same request to the backup. If the operator wants to 
repeat the same request, he or she would normally repeat it on 
the primary after the halted processor is reloaded. If the opera- 
tor submits the request to the backup instantly upon a failure of 
the primary, one of two situations can be expected: the backup 
also fails, or the backup serves the request without any prob- 
lem due to the factors in Table IV. 

In the above examples, the task (i.e., process PK or a com- 
mand to add an I/O unit) does not survive the failure. But 
process pairs allow the other applications on the halted proces- 
sor to continue to run. This situation is not strictly SFT but a 
side benefit of using process pairs. If these failures are ex- 
cluded, the estimated measure of SFT is adjusted to 78% (l 10 
out of 141). 



Fig. 5. Faults exposed by non-process pairs. 


Another reason for the SFT is that some software faults 
cause errors that are detected after the task that caused the 
errors finishes successfully (effect of error latency). Fig. 6 
shows an example. The figure shows a data transfer between 
two primary I/O processes: SIOP(P) and XIOP(P). The under- 
lying software fault was an extra line in the SIOP software that 
caused SIOP(P) to transfer one more byte than was necessary. 
This fault did not always cause a problem, because the size of 
a buffer is usually bigger than the size of a message. When a 
message and a buffer had equal sizes, the first byte in the end 
tag of the buffer was overwritten. This corruption did not af- 
fect the data transfer, because tags are not a part of data area. 
(The tags are used to check the integrity of a data structure, bu’ 
for performance reasons, they are not checked after every data 
transfer.) The data transfer was successfully completed anc 
checkpointed to the backup. The corrupted buffer tag was no 
a part of the checkpoint information. The corruption in the en< 
tag was found later, when SIOP(P) returned the buffer to tfu 
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buffer manager. The buffer manager checked the integrity of 
the begin and end tags, found a corruption, and asserted a halt 
of the processor it runs on (“CPU A” in Fig. 6). The backups 
of the failed primaries would take over, but they would not 
have problems because the data transfer that caused the error 
was already completed successfully. The difference between 
this case and the first group of cases listed in Table IV is that 
the task that caused the failure of the primary did not have to 
be executed again in the backup. 


CPU A 



Fig. 6. Effect of error latency. 

Table IV also shows that 16% of single processor halts (22 
out of 138) were failures of backup processes. This result indi- 
cates that the SFT did not come without a cost; the added 
complexity due to the implementation of process pairs intro- 
duced software faults into the system software. The estimated 
measure of SFT (78%) can be adjusted again to 74% (88 out 
of 1 19) when these failures are excluded. All unidentified fail- 
ures were single processor halts. This is understandable, be- 
cause these failures were caused by subtle faults that are diffi- 
cult to observe and diagnose. The reason that an unidentified 
failure caused a single processor halt is unknown. Based on 
their symptoms, we speculate that a significant number of uni- 
dentified failures were single processor halts because of the 
effect of error latency. 

D. Discussion 

The results in this section have several implications. First, 
the results show that hardware fault tolerance buys SFT. The 
use of process pairs in Tandem systems, which was originally 
intended for tolerating hardware faults, allows the system to 
tolerate about 75% of reported field faults in the system soft- 
ware that cause processor failures. Subtle faults exist in all 
software, but SFT is not achieved if the backup execution is a 
replication of the original execution. The loose coupling be- 
tween processors, which results in the backup execution (the 
processor state and the sequence of events occurring) being 
different from the original execution, is a major reason for the 
measured SFT. Each processor in a Tandem system has an 
independent processing environment; therefore, the system 
naturally provides such differences. (The advantages of using 
checkpointing, as compared with lock-step operation, in tol- 


erating software faults were discussed in [19].) The level of 
SFT achieved by the use of process pairs will depend on the 
proportion of subtle faults in software. While process pairs 
may not provide perfect SFT, the implementation of process 
pairs is not as prohibitively expensive as is developing and 
maintaining multiple versions of large software programs. 

Second, the results indicate that process pairs can some- 
times allow the system to avoid multiple processor halts due to 
software faults, regardless of the nature of the faults, because 
software failures can occur while the system executes impor- 
tant tasks that are not automatically resubmitted to the backup 
on a failure of the primary. In such a case, the failed task does 
not survive, but the other applications on the failed processor 
do. 

Third, short error latency with error confinement within a 
transaction is desirable [30]. In actual designs, such a strict 
error confinement might be rather difficult to achieve. Errors 
generated during the execution of a transaction may be de- 
tected during the execution of another transaction. Interest- 
ingly, long error latency and error propagation across transac- 
tions sometimes help the system to tolerate software faults. 
This result should not be interpreted to suggest that long error 
latency or error propagation across transactions is a desirable 
characteristic. It is a side effect of the system having subtle 
software faults. Long error latency and error propagation 
across transactions can make both on-line recovery and off- 
line diagnosis difficult. 

Finally, an interesting question is: If process pairs are good, 
are process triples better? Our results show that process triples 
may not necessarily be better, because the faults that cause 
double processor halts with process pairs may cause triple 
processor halts with process triples. 

E. First Occurrences vs. Recurrences 

Table V compares the severity of the three types of software 
failures using the 174 software failures discussed in this sec- 
tion. There were two special cases (“Others”) in the table: a 
multiple processor halt due to a parallel execution of faulty 
code (a system coldload was not required) and a software fail- 
ure during a system reboot. With only a single observation in 
each case, the significance of these situations was unclear, and 
they were not considered in the subsequent analysis. “Severity 
Unclear” cases were also not considered further. 


TABLE V 

Severity of Software Failures by Failure Type 



^Failure 

Instances 

^Double 
CPU Halts 

^System 

Coldloads 

/^Severity 

Unclear 

tfOthers 

First 

occurrence 

41 

9 

6 

1 

1 

Recurrence 

107 

19 

12 

3 

1 

Unidentified 

26 

0 

0 

0 

0 


TableV indicates that a recurrence is slightly less likely than 
a first occurrence to cause a double processor halt. The bino- 
mial test was used to test this observation, because it does not 
require an assumption about the underlying distribution to 
construct a confidence interval [31]. Each failure was treated 
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as a random trial with the probability of a double processor 
halt being 0.23 (nine out of 39, following the statistics for the 
first occurrence). The hypothesis that the probability of a re- 
currence causing a double processor halt is equal to that of a 
first occurrence causing a double processor halt was tested by 
calculating the probability of having 19 or fewer double proc- 
essor halts out of 103 trials. The p-value was 0.16; that is, the 
hypothesis was rejected at the 20% significance level. 

Two of the six system coldloads due to first occurrences 
were single processor halt situations. These two failures cap- 
ture the secondary failure mode of the system due to software, 
wherein a system is coldloaded to recover from a severe, sin- 
gle processor halt. 


VI. reliability Modeling 
of Operational Software 

Software reliability models attempt to estimate the reliabil- 
ity of software. Many models have been proposed [1], [2], [3]. 
These models typically attempt to relate the history of fault 
identification during the development phase, verification ef- 
forts, and operational profile. The primaiy focus is on the 
software development phase, and the underlying assumptions 
are that software is an independent entity and that each soft- 
ware fault has the same impact. 

The results from the previous sections indicated that other 
factors significantly impact the dependability of operational 
software. First, software faults can be highly visible or less 
visible. A single, highly visible software fault can cause many 
field failures, and recurrences can seriously degrade software 
dependability in the field. Second, for a class of software such 
as the GUARDIAN operating system, the fault tolerance of the 
overall system can significantly improve software dependabil- 
ity by making the effects of software faults invisible to users. 
Clearly, dependability issues for operational software in gen- 
eral can be quite different from those for the software in the 
development phase. Discussion of software reliability in the 
system context was provided in [32]. An approximate model to 
account for failures due to design faults was used to evaluate 
the dependability of operational software in [33]. The use of 
information on system usage (i.e., installation trail) to predict 
software reliability and to determine test strategy was dis- 
cussed in [34]. a 

This section asks the question: Which factors determine the 

dependability of the measured operating system? Using the 
software failure and recovery characteristics identified in the 
previous sections, this section builds a model to describe the 
impact of faults in the GUARDIAN operating system on the 
reliability of an overall Tandem system. Based on the model, 
the section conducts a sensitivity analysis to evaluate the sig- 
nificance of the factors considered and to identify areas for 
improvement. 

A. Model Construction 

We considered a hypothetical eight-processor Tandem sys- 
tem whose software reliability characteristics are described by 
the parameters in Table VI. In this analysis, the term software 


reliability means the reliability of an overall system when only 
the faults in the system software that cause processor failures 
are considered. A system failure was defined to occur when 
more than half the processors in the system foiled. All parame- 
ters in the table except A and ju were estimated based on the 
measured data (Sections IV and V). The values of A and p. 
were determined to mimic the 30 years of software mean time 
between failure (MTBF) and the mean time to repair (MTTR) 
characteristics reported in [13]. Thus, the objectives of the 
analysis were to model and evaluate reliability sensitivity to 
various factors, not to estimate the absolute software 
reliability. 


TABLE VI 

Estimated Software reliability parameters 


Failures: 

Failure rate 
Prob(double CPU 

haIt)software failure) 
Prob(system failure| 
double CPU halt) 
Prob(system failure| 
sinde CPU halt) 

First 

Occurrence 

A/- =0.24 X 
Qf = 0.23 

C«y = 0.44 

Cu/ = 0.G5 

Recurrence 

Ar =0.61 X 
Cdr =0.18 

C*fr = 0.63 

C, sr =0.0 

Unidentified 

A* =0.15 X 
Cdu = o.o 

c**,=0.0 

C„«=0.0 

Failures: 

.Software failure rate = X 

= 0.32/year 


Recovery: 

Recovery 

rate = U = 3/hour 



In Table VI, “Prob(double CPU halt|software failure)” is the 
probability that a double processor halt (i.e., a failure of a 
process pair) occurs given that a software failure occurs. 
Similarly, “Prob(system failure|double CPU halt)” is the prob- 
ability that a system failure occurs given that a double proces- 
sor halt occurs. These two parameters were used to describe 
the major failure mode of the system because of software. The 
parameter “Prob(system failure|single CPU halt)” represents 
the secondary failure mode, which captures single processor 
halts severe enough to cause system coldloads. The table 
shows these probabilities for first occurrences, recurrences, 
and unidentified failures. 

Based on the parameters in Table VI and on the following 
assumption, we built a continuous-time Markov model to de- 
scribe the software failure and recovery in a hypothetical 
eight-processor Tandem system in the field. 

ASSUMPTION 1. The time between software failures in the sys - 
tern has an exponential distribution, and the three types o] 
failures (first occurrence, recurrence, and unidentified) an 
randomly mixed. 

This assumption was necessary, because determining thi 
above characteristics for a single system would require ; 
minimum of a few hundred years of measurements. The as 
sumption could not be validated using the measured data, be 
cause the measured data was collected from a large number c 
user systems running different versions of the operating syster 
and having different operational environments and syster 
configurations. Given this situation, the assumption seeme 
reasonable. 
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Fig. 7 shows the. Markov model. In the model, S h i - 0, 4 
represents that i processors are halted because of software 
faults. A system failure is represented by the S down state. To 
evaluate software reliability, no recovery from a system failure 
was assumed. That is, the system failure state is an absorption 
state. The R t state represents an intermediate state in which the 
system tries to recover from an additional software failure (rth 
processor halt) using process pairs. 



Fig. 7. Software reliability model. 


If a software failure occurs during the normal system opera- 
tion (i.e., when the system is in the S 0 state), the system enters 
the Ri state. If the failure is severe enough to cause a system 
coldload, a system failure occurs; otherwise, the system at- 
tempts to recover from the failure by using backups. If recov- 
ery is successful, the system enters the S { state; otherwise, a 
double processor halt occurs. If the two halted processors 
control key system resources (such as a set of disks) that are 
essential for system operation, the rest of the processors in the 
system also halt and a system failure occurs; otherwise, the 
system enters the S 2 state and continues to operate. The value 
of r, the transition rate out of an R h is small and has virtually 
no impact on software reliability; a value of one transition per 
minute was used in the analysis. Since the system stays in an R, 
state for a short time, additional failures occurring in an Rj 
state were ignored; in fact, these failures were implied in the 
failure rate (A) in the corresponding 5, and S M states. Given 
the model in Fig. 7, software reliability of the system can be 
estimated by calculating the distribution of time for the system 
to be absorbed to the S dow „ state, starting from the S 0 state. 

In Fig. 7, the three coverage parameters C* C sd , and C„ 
were calculated from Table VI: 


C d - Prob. (double CPU haltl software failure) 

X f C df + X r C dr + X u C du 
Ay F F Ay 

C d = Prob. (sy stem failurel double CPU halt) 

_ X f C df C sdf +X r C dr C sdr +X u C du C sdu 
X f C df +X r C dr + X u C du 


( 2 ) 


(3) 


C = Prob. (system failurel single CPU halt) 

_ ^f C ssf + X r C ssr +X u C ssu 
X f + X r +X u 


(4) 


The parameter C d includes the two cases explained in Sec- 
tion V: the failure of a process pair caused by a single software 
fault and the failure of a process pair caused by two software 
faults (the second halt occurs during job takeover). The pa- 
rameter C sd represents the probability that the system loses key 
system resources as a result of a double processor halt. The 
parameter C sd is determined primarily by the system configu- 
ration and is discussed further in Section VI.D. The above 
three parameters can actually be obtained directly from Table 
V in Section V.E. Equations (2), (3), and (4) will be used to 
investigate the impact of recurrences (A,) on software reliabil- 
ity in Section VLB. 

The model (Fig. 7) includes the effect of multiple independ- 
ent software failures. For example, if a software failure occurs 
when the system is in the Sj state (/ * 0), the following three 
system failure scenarios must be considered (Fig. 8): 

1 ) The system fails regardless of whether the new failure 
causes a single or double processor halt. This is because 
when the first processor halts because of the new failure, 
key system resources (such as a set of disks) become in- 
accessible. 

2) The system fails because the new failure is severe and 
can only be recovered by a system coldload. 

3) The new software failure causes a double processor halt, 
and the second processor halt causes a set of disks to be- 
come inaccessible. 



System Failure 


System Failure 

System in State 

System in State 
System Failure 


Fig. 8. Effect of multiple independent software failures. 


It was not possible to directly measure the branching prob- 
abilities in Fig. 8 for each state from the data, because the 
major failure mode (i.e., a software failure occurred when the 
system is in the S Q state, causing a double processor halt and 
subsequently causing a system failure) was dominant. These 
probabilities were estimated using the three measured parame- 
ters: C* C sdi and C„. Table VII shows the branching prob- 
abilities in Fig. 8 estimated for each S t (/ * 0) state. For exam- 


and 
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TABLE Vn 

Parameters for Multiple Independent software Failures 



pie, given that an additional software failure causes a double 
processor halt when the system is in the S { state, the probabil- 
ity that the third processor halt does not cause a system failure 
(path D in Fig. 8) is (1 C sd )\ This is because the probability 
that the third processor halted and either of the two processors 
that were already halted control key system resources (i.e., 
cause a system failure) is C sd . The branching probabilities m 
Table VII were used to determine the corresponding transition 
rates in the model (Fig. 7). 

The same recovery rate was used regardless of the number 
of processors halted. This was because the recovery time is 
typically determined by the time required to perform a quick 
diagnosis and take a memory dump, which is done for one 
processor at a time. Previous studies assumed that the failure 
rate is proportional to the number of processors up and work- 
ing [35]* The same software failure rate was assumed in all 
states, considering that, as more processors halt, the remaining 
processors will receive more stress. Again, the dominance of 
the major system failure mode did not allow us to estimate the 
parameters from the data. 

The distribution of time for the system to be absorbed to the 
system failure state, starting from the normal state, was evalu- 
ated using the model in Fig. 7. SHARPE [36] was used for the 
evaluation. Fig. 9 shows the software reliability curve of the 
modeled system and confirms the assumed software MTBF of 
30 years. The figure represents the reliability of an overall 
Tandem system in the field when only the faults in the system 
software that caused processor failures were considered. 


Reliability 



B. Reliability Sensitivity Analysis 

Table VIII shows the six factors considered in the analysis. 
The second column of the table shows activities related to 
these factors, and the third column shows the model parame- 


ters affected by the factors. For example, a 10% reduction in 
the recurrence rate (A,), which can be achieved by improving 
the software service environment, will reduce A by 6.1% 
(Table VI) and change C dy C sd , and C„ accordingly. Refer to 

Equations (2), (3), and (4). . 

The coverage parameters C d and C sd are determined primar- 
ily by the robustness of process pairs and the system configu- 
ration, respectively. For example, C d can be reduced by con- 
ducting extra testing of the routines related to job takeover. 
The parameter C sd is determined by the location of failed proc- 
ess pairs and the disk subsystem configuration. This parameter 
is discussed further in Section VI.D. Analytical models for 
predicting coverage in a fault-tolerant system and the sensitiv- 
ity of system reliability/availability to the coverage parameter 
were discussed in [37]. The recovery rate fi can be improved 
by automating the data collection and reintegration process. 


TABLE VIII 

Factors of Software Reliability 


Factor 

Activity 

Related Parameters 

Detailed 1 Overall 

Software failure rate 
Recurrence rate 
Coverage parameter Cj 

Coverage parameter C*t 
Coverage parameter C„ 
Recovery time 

Software development 
Software service 
Robustness of process 
pairs 

System configuration 

Diagnosability/ 

maintainability 

A f. At, Xu 

Ar 

Cjf, Cdr , Cdu 

Csdfy Csdrj Csdu 
Cssfi Cssr> Cuu 

A 

A, Cd, Csd, Css 
Cd 

Csd 

Css 


Fig. 10 shows the software MTBF evaluated using the 
model in Fig. 7 while varying the six factors in Table VIII, one 
at a time. It is interesting to see that C d and C sd are almost as 
important as A in determining the software MTBF. For exam- 
ple, a 20% reduction in C d or C sd has as much impact on soft- 
ware MTBF as an 18% reduction in A. (The figure shows that 
the impact is approximately a 20% increase in software 
MTBF.) This result is understandable because the system fails 
primarily because of a double processor halt causing a set of 
disks to become inaccessible, not because of multiple inde- 
pendent software failures. 

Fig. 10 also shows that the recurrence rate has a significant 
impact on software reliability. A complete elimination of re- 
currences (A, » 0 in Table VI) would increase the software 
MTBF by a factor of three. The impact of C„ on software reli- 
ability is small, because severe, single processor halts causing 
system coldloads are rare. The impact of fi on software MTBF 
is virtually nil. In other words, recovery rate is not a factor as 
far as software reliability is concerned, again, because the 
system is unlikely to fail because of multiple independent 
software failures. 
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Fig. 10. Software MTBF sensitivity. 

Typically, it is assumed that the number of faults in software 
is the only major factor determining software reliability. 
Fig. 10 clearly shows that in the Tandem system, there are four 
degrees of freedom in improving the software reliability: the 
number of faults in software, the recurrence rate, the robust- 
ness of process pairs, and the system configuration strategy. 
The first two are general factors, and the last two are platform- 
dependent factors. Efforts to improve software reliability can 
be optimized by estimating the cost of improving each of these 
four factors. 

C. Reliability Sensitivity to Fault Category 

This section investigates the impact of software faults in dif- 
ferent fault categories (Table I in Section IV) on software reli- 
ability. In this section, a failure group is defined as the group 
of software failures caused by all faults that belong to a fault 
category. We estimated the software MTBF by assuming that 
each failure group is empty, i.e., the faults in a fault category 
did not cause software failures. The failure rate and the cover- 
age parameters for the model in Fig. 7 were adjusted: 

A = 

total no. of software failures — no. of software failures in a failure group ^ 
total no. of software failures 

Q = 

total no. of double CPU halts — no. of double CPU halts in a failure group 
total no. of software failures - no. of software failures in a failure group 

C sd = 

total no. of system failures - no. of system failures in a failure group 
total no. of double CPU halts - no. of double CPU halts in a failure group’ 

and 

c « = ( 8 ) 
total no. of severe, single CPU halts - no. of severe, single CPU halts in » failure group 
total no. of software failures - no. of software failures in a failure group 

In Equation (7), only those system failures caused by double 
orocessor halts (i.e., failures of process pairs) were counted. 

Table IX shows the results. The last column of the table 
shows the improvement in software MTBF when failures 
;aused by each fault category were eliminated. Only those 
categories that have more than 10 failures were considered. 
The table shows that “Missing operation” caused the greatest 
eliability loss. Further analysis showed that uninitialized 
winters (Table I in Section IV) were responsible for more 
han half of the loss caused by this group of failures. The table 


also shows that “Unexpected situation” was another significant 
source of reliability loss. Most of this loss is attributed to faults 
such as incorrect parameters passed by user processes, illegal 
procedure calls made by user processes, and not considering 
all legitimate operational scenarios in designing software. (The 
reliability loss is not attributed to subtle faults, such as race 
conditions and timing problems.) Additional code inspection 
and testing efforts can be directed to these fault categories. 
Unidentified failures had virtually no impact on software reli- 
ability, because all of these failures caused single processor 
halts. 


TABLE IX 

Reliability Sensitivity to Fault Category 


Fault Category 

#Failures 

MTBFimroved 

MTBF current 

Incorrect computation 

3 

- 

Data fault 

21 

1.00 

Data definition fault 

7 

- 

Missing operation 

27 

1.47 

Side effect of code update 

5 

- 

Unexpected situation 

46 

1.35 

Microcode defect 

8 

- 

Others 

12 

1.06 

Unidentified 

26 

1.00 

Unable to classify 

24 

1.12 


D. Impact of System Configuration 
on Software Dependability 

System configuration is an issue that demonstrates the im- 
portance of considering the interactions between hardware, 
software, and operations. Table X shows a breakdown of the 
process pairs whose failures caused the 18 observed system 
failures, based on their configurability. In the table, a 
“Location-free” process pair is a pair that can be placed on any 
two processors in the system, independent of. hardware con- 
figuration. The location of a nondisk or disk I/O process pair is 
determined by hardware configuration. The failure of a non- 

(6) disk I/O or location-free process pair causes a system failure, 
because the process pair executes on the two processors that 
execute a disk process pair. Thus, a double processor halt re- 

(7) suiting from a failure of such a nondisk I/O or location-free 
process pair would cause a set of disks to become inaccessible. 


TABLE X 

Configurability of Failed Process Pairs 
That Caused System Failures 


Failed Process Pair 

#System Failures 

Location-free process pair 

7 

Nondisk I/O process pair 

5 

Disk I/O process pair 

2 

Others 

4 


Table X shows that the number of system failures could po- 
tentially be reduced by 67% (12 out of 18) by avoiding the 
overlap in location between disk process pairs and the failed 
nondisk I/O or location-free process pairs. This result demon- 
strates the importance of considering software dependability in 
the context of an overall system. 
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VII. Conclusions 

Based on field failure data collected from the Tandem 
GUARDIAN operating system, this paper discussed evaluation 
of the dependability of operational software. The software 
faults considered are major defects that result in processor 
failures and invoke backup processes to take over. The paper 
categorized the underlying causes of software failures, dis- 
cussed the significance of failure recurrence, and evaluated the 
effectiveness of the process pair technique in tolerating soft- 
ware faults. The paper built a model to describe the impact of 
faults in the GUARDIAN operating system on the reliability of 
an overall Tandem system. The model was used to evaluate the 
significance of key factors that determine software depend- 
ability and to identify areas for improvement. 

An analysis of the data showed that about 77% of processor 
failures that are initially considered due to software are con- 
firmed as software problems, 13% of them are probably soft- 
ware problems, and the rest are confirmed as non-software 
problems. The analysis showed that hardware fault tolerance 
buys SFT. Using process pairs in Tandem systems, which was 
originally intended for tolerating hardware faults, allows the 
system to tolerate about 75% of reported software faults that 
result in processor failures. The loose coupling between proc- 
essors, which results in the backup execution (the processor 
state and the sequence of events) being different from the 
original execution, is a major reason for the measured SFT. 
This shows that the checkpointing and restart technique can be 
used as a low-cost SFT strategy. The results indicated that the 
actual level of SFT achieved by the use of process pairs de- 
pends on the degree of difference in the processing environ- 
ment between the original and backup executions and on the 
proportion of subtle faults in the software. 

Over two-thirds (72%) of reported software failures in Tan- 
dem systems are recurrences of previously reported faults. The 
modeling, based on the data, showed that, in addition to reduc- 
ing the number of software faults, software dependability in 
Tandem systems can be enhanced by reducing the recurrence 
rate and by improving the robustness of process pairs and the 
system configuration. Omission of lines of source code and not 
providing routines to handle rare but legitimate operational 
scenarios are the most common types of software faults in the 
GUARDIAN operating system. These types of faults are also 
the major causes of software reliability loss. The investigation 
of the impact of system configuration on software dependabil- 
ity demonstrated the importance of considering software de- 
pendability in the context of an overall system. 

It is suggested that more measurements and analyses be 
conducted in the manner proposed here so that a wide range of 
information on the dependability of operational software is 
available. 
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